The size diversity of the Pteridaceae family chloroplast genome is caused by overlong intergenic spacers

Background While the size of chloroplast genomes (cpDNAs) is often influenced by the expansion and contraction of inverted repeat regions and the enrichment of repeats, it is the intergenic spacers (IGSs) that appear to play a pivotal role in determining the size of Pteridaceae cpDNAs. This provides an opportunity to delve into the evolution of chloroplast genomic structures of the Pteridaceae family. This study added five Pteridaceae species, comparing them with 36 published counterparts. Results Poor alignment in the non-coding regions of the Pteridaceae family was observed, and this was attributed to the widespread presence of overlong IGSs in Pteridaceae cpDNAs. These overlong IGSs were identified as a major factor influencing variations in cpDNA size. In comparison to non-expanded IGSs, overlong IGSs exhibited significantly higher GC content and were rich in repetitive sequences. Species divergence time estimations suggest that these overlong IGSs may have already existed during the early radiation of the Pteridaceae family. Conclusions This study reveals new insights into the genetic variation, evolutionary history, and dynamic changes in the cpDNA structure of the Pteridaceae family, providing a fundamental resource for further exploring its evolutionary research. Supplementary Information The online version contains supplementary material available at 10.1186/s12864-024-10296-0.


Introduction
Ferns are one of the oldest and most primitive vascular plant groups on Earth [1].They are a group of vascular plants with independent gametophyte and sporophyte generations, mainly undergoing sexual reproduction through spores.Pteridaceae is the second most generarich fern family.According to the Pteridophyte Phylogeny Group I (PPG I) classification, Pteridaceae contain five subfamilies, 53 genera, with an estimated 1,211 species contributing to about 10% of extant leptosporangiate fern diversity [2,3].These species have multiple values.For example, the Pteris species can accumulate arsenic, which is of great significance for the remediation of heavy metals in soil [4].Plenty of the Adiantum species can be used in medicine and are used in different parts of the world [5][6][7][8].Pteridaceae species have a cosmopolitan distribution concentrated in wet tropical and arid regions, occupying various ecosystems such as terrestrial, epiphytic, rupestral, and even aquatic [9].
In plants, chloroplasts are the site of photosynthesis and play an important role in the synthesis of defenserelated hormones, which sustain life on Earth [10,11].Chloroplasts also participate in some metabolic processes [11] and play important roles in plant adaptation to environmental stress [12][13][14].Chloroplasts typically possess independent genes and mechanisms for gene expression [15].The chloroplast genomes (cpDNAs) of land plants are typically 110-160 kb in size [16], usually divided into a large single-copy (LSC) region and a small single-copy (SSC) region by a pair of inverted repeats (IRa and IRb), forming a typical quadripartite structure [17].The cpDNA is mostly inherited from one parent, its structure is conservative, recombination is less, and the substitution rate is much lower than that of the nuclear genome [18][19][20].With the advancement of sequencing technology, cpDNA has become more accessible, therefore, it has become increasingly common to use chloroplasts to explore plant evolutionary events [21,22].
Comparing complete cpDNAs contributes to the study of mechanisms underlying genome evolution, revealing evolutionary relationships and phylogenies among species.For instance, lycophytes share a similar chloroplast gene order with mosses, while displaying an inverted gene order compared to all other vascular plants, providing evidence for the ancient evolution of early vascular land plants [23].Similarly, the expansion and contraction of the IR region often serve as evidence for interspecies phylogenetic relationships in chloroplast genome studies [24,25].In the context of evolution, the substantial loss of genes was initially linked to endosymbiotic events, but subsequent research indicates that gene loss independently occurred in different lineages [26].This suggests that a series of complex evolutionary constraints, selection, and convergence led to the conservation of chloroplast genome structure and content.For instance, the evolution of parasitic angiosperms has resulted in the relaxation of evolutionary constraints associated with the maintenance of photosynthetic functionality.Therefore, in the early stages of parasitic evolution, some photosynthetic genes (such as ndh-) were lost, leading to significant changes in chloroplast genome content [27].
Research has shown that the size of cpDNA is often influenced by changes in the IR boundaries [28].Furthermore, a high number of repetitive sequences in cpDNA has been recognized as a contributing factor to variations in genome size [29][30][31].In this study, the variations in sizes of Pteridaceae cpDNAs were ascribed to alterations in the length of overlong intergenic spacers (IGSs), with these IGSs exhibiting species-specific differences.Through sequencing the cpDNAs of five Pteridaceae species and comparing their structures with 36 other reported cpDNAs in the Pteridaceae family, this study aimed to uncover the evolutionary dynamics, genetic variations, and evolutionary relationships of cpDNAs among different species within the Pteridaceae family.

Sequence variation analysis
Multiple alignments of the 41 Pteridaceae cpDNAs revealed higher divergence in non-coding sequences than in coding regions (Figure S2).Particularly, IGS regions exhibited significant differentiation, while coding regions like matK, cemA, rpoC2 and ycf1 also showed variation.Overall, the IR region of the 41 Pteridaceae cpDNAs had the highest degree of conservation, while the single copy region had less conservation.Nucleotide diversity (Pi) values ranged from 0.006 to 0.376 for common genes and from 0 to 0.603 for common IGS regions.MatK, ndhF, ndhH -rps15, and trnL -ccsA showed notably higher Pi values, indicating substantial single nucleotide polymorphism (Figure S3).These markers could be utilized for distinguishing different species or populations, with matK already recognized as the core DNA barcode for ferns [32].
Dispersed repeats, predominantly forward and palindromic, were found in the cpDNAs of all 41 Pteridaceae species, with complement and reverse repeats observed in a few species (Table S2).Tandem repeats were identified in all other species except Adiantum nelumboides and Adiantum reniforme var.sinense (Table S3).Most repeats were within 100 bp, with some exceeding 200 bp (Fig. 1C).The majority of repeats were in the LSC (median: 48.80%) and IR regions (median: 54.21%), LSC: Large-single copy, SSC: Small-single copy, IR: Inverted repeat compared to the SSC region (median: 7.84%) (Fig. 1D).

Expansion and contraction of IR boundary analysis
The IR/SC boundary genes of the 41 Pteridaceae species had varying degrees of expansion and contraction.No similar patterns were found among the IR/SC boundaries in five different subfamilies (Fig. 2).The genes located at the IR/SC boundaries of these species, primarily included rpl23, trnI-CAU, trnT-UGU, trnR-ACG, ndhF, chlL, trnN-GUU, and ndhB.The IR/SSC boundary genes in these species were consistent, with only slight displacement near the boundary.The main reason for the differences was the inversion of the SSC region.In contrast, the LSC/IR boundary underwent much greater changes, such as trnI-CAU of Adiantum malesianum, Ceratopteris cornuta, Ceratopteris thalictroides and Vaginularia trichoidea all entering the IRb region; while the trnI-CAU of other species was located in the LSC region or on the LSC/IRb boundary.Another reason for differences in IR/ LSC boundary genes was the absence of trnT-UGU in some species.

The relationship between overlong IGS and CpDNA size
Overlong IGSs seemed to be common in the Pteridaceae cpDNAs, but no reliable patterns of occurrence were found between subfamilies or genera (Fig. 2).In this study, within the same IGS across different species, lengths greater than the mean of these IGSs were defined as overlong IGS.By comparing the length of the positions with overlong IGS in the Pteridaceae cpDNAs (Table 2), it was found that they frequently occurred in the rpoB -trnD region of the LSC and the rps12 -rrn16 region of the IRs.Additionally, in some species, the trnD -trnY  (LSC), ndhC-trnV (LSC), psbE -petL (LSC), and rps15 -ycf1 (SSC) IGSs also had longer lengths.Chloroplast genes were usually relatively conservative, so this random change of overlong IGSs was likely the main reason for the difference in cpDNA sizes of Pteridaceae species (Fig. 3A).For instance, Hemionitis subcordata, Myriopteris scabra, and Pteris semipinnata had larger cpDNAs, and their interiors contained overlong IGS regions.The lengths of the LSC, SSC, and IR regions of these species were separately calculated, and it was found that most of the factors contributing to the differences in these region lengths were largely due to the presence of these overlong IGSs (Fig. 3B).Moreover, the length of IGSs in each species was linearly related to their cpDNA sizes (Fig. 3C), and there was a highly significant positive correlation (r = 0.819, p = 5.798e-11 < 0.001).

Characteristics and analysis of overlong IGSs
This study analyzed the GC content of these sequences and found that the overall GC content of cpDNAs was less affected by these overlong IGSs.However, specific IGSs encompassing both overlong and non-overlong situations exhibited a significant difference (p = 1.189e-11 < 0.001); overlong IGSs tended to exhibit higher GC content (Fig. 4A).These expanded IGSs exhibited collinearity across diverse intergenic regions in various species (Figure S4).Upon conducting homologous sequence alignment of these elongated IGSs in the NCBI database, it was found that the majority of these homologous sequences originated from fern cpDNAs.In certain overlong IGSs, such as rps12-rrn16, alignments were observed with sequences from mitochondrial genomes of Haplopteris ensiformis (Pteridaceae), suggesting that they may transfer within organelles through mechanisms such as gene transfer or horizontal gene transfer.In addition, repetitive sequences and transposable elements located within these IGS were screened.The results of the Mann-Whitney U test revealed that compared to non-overlong IGSs, there were significantly higher numbers of SSRs (p = 0.016), tandem repeats (p = 2.2e-16 < 0.001), and dispersed repeats (p = 2.2e-16 < 0.001) in overlong IGSs (Fig. 4B).Regarding the length relationship between repetitive sequences and IGSs, although not statistically significant, a strong positive correlation was observed between tandem repeats and dispersed repeats with the expansion of the IGSs (r = 0.77 and 0.72, respectively) (Fig. 4C).For transposable elements, relevant sequences could not be retrieved structurally, but similar short fragments of different types of transposable elements were identified in A. malesianum (Gypsy, 48 bp) and Pteris arisanensis (Copia, 65 bp) (Table 3).

Phylogenetic relationship and divergence time estimate
The BI tree and ML tree, constructed using the common protein-coding sequences of all species, were consistent (Fig. 5)

Discussion
This study sequenced the cpDNA structures of five Pteridaceae species and examined those of 41 species, covering all subfamilies.They exhibited typical quadripartite structures, with genome sizes ranging from 145,327 bp to 165,631 bp and GC contents between 36.7% and 45.3%.Upon re-aligning and completing missing gene annotations, higher gene losses were observed in P. bipinnata var.bipinnata (7 chloroplast genes lost) and A. speciosum (9 lost) among the Pteridaceae cpDNAs.Additionally, trnR-UCG and trnT-UGU were frequently lost among the 41 Pteridaceae cpDNAs, along with a common loss of ycf94.The alignment of all 41 Pteridaceae cpDNAs revealed poor alignment in non-coding regions,  especially in the IGS regions.Four regions with significantly higher Pi values compared to other genes/IGSs were identified: matK, ndhF, ndhH -rps15, and trnL -ccsA.Among these, matK has been used as a core DNA barcode for ferns, and the other three markers may serve as candidate DNA barcodes for species within the Pteridaceae family.
Repetitive sequences can be dispersed widely or found in simple tandem arrays.SSRs, also known as microsatellites, consist of 1-6 nucleotide tandem repeat motifs and are distributed throughout the genome [33,34].SSRs are highly polymorphic and specific, making them valuable for studying molecular evolution, genetic diversity, and developing molecular markers [35,36].The diversity in repeat length, copy number, and distribution within species is attributed to slipped-strand mispairing during DNA replication on a single strand [37,38].Mononucleotide repeats, especially A/T motifs, were the most common in this study.A potential reason for the higher frequency of A/T repeats is that during chloroplast genome replication, the separation of AT strands is relatively easier compared to GC, which increases slip mismatching [39].In the Pteridaceae cpDNAs, SSRs were mainly located in the LSC region (38.71-79.73%)and tended to occur in IGS (54.84-91.18%),possibly due to stronger constraints in coding regions [40].Short repeat units can also be further extended into longer tandem repeats through slipped-strand mispairing or recombination [41][42][43], with the number of tandem repeats varying due to susceptibility to slippage events during DNA replication [44].Dispersed repeats are often associated with and contribute significantly to the chloroplast genome rearrangement in plants [25,45].Here, all species except A. nelumboides and A. reniforme var.sinense exhibited tandem and dispersed repeats, primarily less than 100 bp in size, with forward and palindromic repeat motifs predominating.Furthermore, these repeats were more prevalent in the IGS regions.
The substitution rate in chloroplast IR region genes is significantly lower than that in the SC region, thus greater conservation in the IR region [46].However, structural variations in the IR/SC boundary regions are still common [47][48][49].Among the 41 Pteridaceae cpD-NAs, varying degrees of IR/SC boundary expansion and contraction were observed, even within the same genus.The IR/SSC boundary genes remained consistent in the Pteridaceae cpDNAs, with differences primarily attributed to SSC region inversions.In contrast, the LSC/IR boundary varied more due to changes in the trnI-CAU position and the absence of trnT-UGU in some species.The variation in cpDNA size is often associated with changes in the IR/SC boundary [50][51][52] and the expansion of repetitive sequences [30,31].In this study, the movement in the IR/SC boundary genes of Pteridaceae cpDNA only led to minor differences in cpDNA size.For instance, the cpDNA size of H. subcordata was the largest, with a longer IR region due to the expansion of the rps12 -rrn16 IGS, rather than significant expansion of its IR/SC boundaries.A strong correlation between cpDNAs and IGSs was observed, and there was a common occurrence of overlong IGSs in species within this family.These overlong IGSs consistently aligned with the changes in the Pteridaceae cpDNA size, implying their primary influence on cpDNA size and their potential role in driving cpDNA structure evolution [53].In cases like A. malesianum, overlong IGS amplifies cpDNA size and triggers sequential movement of LSC region genes, affecting IR/SC boundaries [48].
The overlong IGSs prevalent in the Pteridaceae family were found in various intergenic regions across different species and showed a degree of collinearity (Figure S4).Mobile elements are present in the fern cpDNAs and are often found near genome inversion sites [53].In this study, only a few inversions occurred in the Pteridaceae cpDNAs, such as the ndhJ -psbE, the rrn5 -rrn16, and the SSC region.Some overlong IGSs were also found near inversion sites, such as rrn16 -rps12, which may have served as hotspots for IGS expansion.Additionally, the psbE -petL IGS of Pentagramma triangularis also underwent expansion.Within the same IGS, the GC content of overlong IGSs that underwent expansion was consistently higher, showing a significant difference compared to non-overlong IGSs.An important characteristic of GC base pairs is their higher thermal stability compared to AT base pairs [54].These interactions appear to be crucial for the overall structural stability of DNA and RNA transcripts [55,56].Significant differences in GC content exist among different genomes and within different regions of genomes.Some studies suggest a correlation between GC content and the length of coding genes, where the length of exons often increases with higher GC content [57].This is because stop codons are rich in AT, consequently resulting in a lower frequency of stop codon occurrence in GC-rich exons [58].The increase in GC content may also be attributed to the presence of more GC-rich sequences within these overlong IGSs, such as repetitive sequences.In the Pteridaceae cpDNAs, the overlong IGSs contained significantly more repetitive sequences, especially tandem repeats and dispersed repeats; meanwhile, these repeats had a strong positive correlation with the expansion of IGSs (r = 0.77 and 0.72, respectively), although statistical significance was not achieved.This suggests that repetitive sequences may promote the occurrence of chloroplast genome structural variation (SV).For instance, the location of SV in the Carex cpDNAs is closely related to the location of long repeats [59].The amplification of the Cyripedium cpDNAs is associated with a surge in ATbiased repeats [30].In addition, similar fragments were observed in the mitochondrial genomes of H. ensiformis and detected transposable element-like fragments in a few species, suggesting that they may transfer among different organelles.
According to this study, the Pteridaceae family had clear boundaries in both subfamilies and genera.The Pteris and Adiantum were both monophyletic, consistent with previous research [2,[60][61][62].Based on fossil evidence, ferns are believed to have originated in the Devonian period [63], and their dominance continued into the Paleozoic era [64].Here, the MCMCTree model suggested that the divergence of the Pteridaceae family from the outgroup occurred during the Jurassic period (∼ 180.72 Mya).Fossil records from the Jurassic period indicate significant fern evolution [65,66], with favorable climate and environmental conditions contributing to their survival and reproduction during this time.As a result, ferns occupied a crucial ecological niche on Earth during this period, evolving a wide range of morphological and ecological characteristics that had a significant impact on the evolution and diversity of terrestrial ecosystems [67].The J/K boundary period represents a time of environmental upheaval, characterized by intense transgressive phases due to rapidly changing sea levels [68].The subfamily of Parkerioideae represents an aquatic branch, with its species thriving in wet aquatic environments [69,70], diverged around the J/K boundary period (∼ 142.49Mya) and possibly underwent adaptive evolution.In addition, the divergence of the genus Acrostichum, within Parkerioideae, occurred around the Late Cretaceous (∼ 78.78 Mya), overlapping with the fossil record of this genus [71].During the Late Cretaceous period, the genus of this family began to rapidly diverge.Overlong IGSs were present during the early divergence stages of the Pteridaceae family, indicating that this structural feature may have an ancient origin in related species.As species rapidly diversified, the prevalence of these overlong IGSs gradually increased.

Conclusion
This study offers comprehensive insights into Pteridaceae cpDNAs.Chloroplast gene numbers were mostly stable, except for P. bipinnata var.bipinnata and A. speciosum.Changes in LSC/IR boundaries resulted from trnI-CAU movement and trnT-UGU deletion.SSC/IR boundary shifts were mainly due to SSC region inversion.The Pteridaceae cpDNAs often had overlong IGSs, increasing non-coding region variability and affecting cpDNA size changes.These overlong IGSs had higher GC content and were rich in repetitive sequences.Divergence time analysis traced Pteridaceae separation to the Jurassic (∼ 180.72 Mya), with rapid diversification within the genera beginning in the Late Cretaceous period.Additionally, overlong IGSs may have already existed during the early differentiation stages of this family.This study provides further theoretical support for the classification of Pteridaceae species, genetic diversity, and the evolution of genomic structure.

Plant materials, DNA extraction and De Novo sequencing
Obtaining a more complete chloroplast genome helps to understand their structural evolution.This study added cpDNAs of species Pteris ensiformis, P. arisanensis, Taenitis blechnoides, Adiantum flabellulatum and A. malesianum.Fresh leaves of the first three were sampled from the campus of Shenzhen Fairy Lake Botanical Garden [72].Fresh leaves of the latter two were sampled from the campus of South China Agricultural University (SCAU) [48].The plant materials used in the study were identified by Ting Wang and deposited in the Herbarium of SCAU with specimen numbers GXL20210901 (A. flabellulatum), GXL20210902 (A. malesianum), GXL20210903 (P.ensiformis), GXL20210904 (P.arisanensis), and GXL20210905 (T.blechnoides).DNA was extracted from the samples using a Tiangen Plant Genome DNA Kit (Tiangen Biotech Co., Ltd., Beijing, China) according to the manufacturer's instructions.The Illumina Nova-Seq6000 platform was used for sequencing.

Comparative genome and boundary regions analysis
The complete cpDNAs of 36 Pteridaceae species were downloaded from GenBank (Table 1).Combining our five sequenced species, a total of 41 Pteridaceae species were examined, covering all subfamilies.The accuracy of gene annotation for these 41 Pteridaceae cpDNAs was rechecked, and local BLAST was used for homologous sequence retrieval to complete some missing annotation genes (Figure S1).The boundary region of the 41 Pteridaceae cpDNAs was rechecked using Geneious [77] and displayed using Adobe Illustrator 2020, to better observe the expansion/contraction of IR regions.

Repetitive sequence analyses
The Perl script MISA (http://pgrc.ipk-gatersleben.de/misa/) [78] was used with the filter thresholds set to detect SSRs.The following parameters were set: a minimal repeat number of 10 for mononucleotide repeats, 5 for di-, 4 for tri-, and 3 for tetra-, penta-, and hexanucleotide SSRs.Tandem repeats were found with the tandem repeats finder (TRF) using default parameters [79].To identify complex repetitive sequences such as forward, reverse, complement and palindromic, REPuter online software [80] was used with a minimum repeat size of 30 bp and 90% sequence identity (Hamming distance of 3).The transposable elements were retrieved using RepeatMasker [81], with the rmblast database selected and the reference species aligned to "viridiplantae", with all other parameters set to default.

Sequence divergence analysis
The comparative analysis was carried out by using the shuffle-LAGAN mode in mVISTA online tool [82] to analyze the cpDNA divergence of the 41 Pteridaceae species, with A. capillus-veneris (NC_004766.1) as a reference.Extracted the common genes and IGSs of these cpDNAs as independent datasets, aligned each dataset in MAFFT v7.475 using default parameters [83], and calculated nucleotide diversity in DnaSP v6.0 [84].

Phylogenetic analysis and divergence time estimates
Phylogenetic reconstruction of the above 41 Pteridaceae species with Gymnosphaera metteniana and Alsophila spinulosa as outgroups.76 common but not repetitive protein-coding sequences of these species were retained, and MAFFT was used to perform sequence alignment, and remove 90% of the gaps in each multi-alignment sequence using trimAl [85].PhyloSuite [86] was used to concatenate these sequences into a dataset for phylogenetic analysis.The ML tree was inferred using RAxML [87], GTRGAMMAI was selected as the nucleotide substitution model.The Bayesian inference (BI) tree was established by MrBayes [88] and was estimated by running 2,000,000 generations (Nst = 6, rates = invgamma).
In this study, the differentiation time estimated by TimeTree [89]  Inferring the time tree of Pteridaceae using the MCMCTree software package of PAML [90], the model was set to GTR, and the MCMC procedures had a burn-in of 2,000 iterations and then ran for 20,000 iterations.MCMCTree analysis was performed twice, which generated similar results, confirming the robustness of the analysis.The final tree was visualized and edited in FigTree v.1.4.3 (http://tree.bio.ed.ac.uk/software/figtree/).

Fig. 1 Fig. 2
Fig. 1 Comparison of repetitive sequences among the 41 Pteridaceae cpDNAs.(A) The number of SSRs among each species.(B) The percentage of SSRs located in different cpDNA regions and gene sequence regions.(C) The size distribution of dispersed repeats and tandem repeats among the 41 Pteridaceae cpDNAs.(D) The percentage of dispersed repeats and tandem repeats located in different cpDNA regions and gene sequence regions Fig. 2 Comparison of IR/SC boundaries among the 41 Pteridaceae cpDNAs.The numbers above, below, or adjacent to genes represent gene length or the distances from the front or end of genes to the boundary sites.Figure features are not to scale

Fig. 3
Fig. 3 Comparison of chloroplast genome features in the 41 Pteridaceae species.(A) Comparison of the cpDNA sizes with the overlong IGSs; even if the overlong IGS is in the IR regions, the figure only shows the length of one copy of the IGS.(B) Comparison of the lengths of the LSC, SSC, and IR regions; with * indicating regions containing overlong IGS.(C) The correlation between IGS length and cpDNA sizes

Fig. 5
Fig. 5 Phylogenetic relationship (right) and divergent time estimate (left) of the 41 Pteridaceae species.The mean divergence time of the nodes is shown next to the nodes while the blue bars correspond to the 95% highest posterior density (HPD).The red dots represent species within the branch that contain overlong IGS.Bootstrap value/posterior probabilities < 100%/1 are displayed on the branches

Table 1
CpDNA features of the 41 Pteridaceae species

Table 3
The homologous fragments of transposable elements contained in the overlong IGSs